Homework 3 - Michał Gromadzki

Importing libraries

Loading dataset

EDA and Preprocessing

Checking for nulls.

No nulls.

Encoding categorical features.

Checking correlation.

A strong correlation is observed only with smoking

Models

LinearRegression

Forest

XGBoost

Homework

Selecting observation

Ceteris Paribus profiles

LinearRegression

As we can see from CP profiles features with significant impact on the prediction are age, BMI and smoker. While smoker has the steepest which suggestes that this feature has the biggest impact on the prediction.

Forest

Main conclusion are the same as in the example above. Additionally we can see that predicted charges increase dramatically at BMI=30. There is also more irregularities in the plots, which couldn't exist in the first example because of different nature of the used model. Moreover it is easier to differentiate between continuous and discrete features.

XGBoost

The differences between second and third model are much smaller than between first and the second one. It is still possible to see the jump in prediction at BMI=30. There is also even more irregularities in the plots then in the second example. We can predicted that this effect takes place becouse the third model is more complex than the second one. Additionally for the first time we can see different predictions based on sex.

Comments

LinearRegression

As we can see from CP profiles features with significant impact on the prediction are age, BMI and smoker. While smoker has the steepest which suggestes that this feature has the biggest impact on the prediction. While sex and region appear to have little to no impact on the prediction.

Forest

Main conclusion are the same as in the example above. Additionally we can see that predicted charges increase dramatically at BMI=30. There is also more irregularities in the plots, which couldn't exist in the first example because of different nature of the used model. Moreover it is easier to differentiate between continuous and discrete features. Furthermore prediction is about 8% off correct value, while the first model is about 40% off.

XGBoost

Main conclusion are the same as in the example above. Additionally we can see that predicted charges increase dramatically at BMI=30. There is also more irregularities in the plots, which couldn't exist in the first example because of different nature of the used model. Moreover it is easier to differentiate between continuous and discrete features. Furthermore prediction is much closer to the correct value. Furthermore prediction is only about 1.5% off correct value.